Atomic norm denoising with applications to line spectral estimation* 



Badri Narayan Bhaskart, Gongguo Tang''", and Benjamin Recht* 
t Department of Electrical and Computer Engineering 
"Department of Computer Sciences 
University of Wisconsin-Madison 

April 2012. Last Revised Feb. 2013. 

CO 

Abstract 

o 

Motivated by recent work on atomic norms in inverse problems, we propose a new approach to 
O line spectral estimation that provides theoretical guarantees for the mean-squared-error (MSE) 

1} performance in the presence of noise and without knowledge of the model order. We propose an 

abstract theory of denoising with atomic norms and specialize this theory to provide a convex 
\Q optimization problem for estimating the frequencies and phases of a mixture of complex expo- 

nentials. We show that the associated convex optimization problem can be solved in polynomial 
time via semidefinite programming (SDP). We also show that the SDP can be approximated 
^ I by an ^-regularized least-squares problem that achieves nearly the same error rate as the SDP 

I— I but can scale to much larger problems. We compare both SDP and £i-based approaches with 

c/3 classical line spectral analysis methods and demonstrate that the SDP outperforms the t\ opti- 

i ^ i mization which outperforms MUSIC, Cadzow's, and Matrix Pencil approaches in terms of MSE 

over a wide range of signal-to-noise ratios. 

CN 
> 

<N l Introduction 

Extracting the frequencies and relative phases of a superposition of complex exponentials from 
a small number of noisy time samples is a foundational problem in statistical signal processing. 
These line spectral estimation problems arise in a variety of applications, including the direction of 
arrival estimation in radar target identification [TJ, sensor array signal processing [2] and imaging 
systems [3] and also underlies techniques in ultra wideband channel estimation [4] , spectroscopy [5] , 
molecular dynamics |6j, and power electronics n\. 
^ While polynomial interpolation using Prony's technique can estimate the frequency content of 

a signal exactly from as few as 2k samples if there are k frequencies, Prony's method is inherently 
unstable due to sensitivity of polynomial root finding. Several methods have been proposed to 
provide more robust polynomial interpolation [8f|10| (for an extensive bibliography on the subject, 
see [ll]), and these techniques achieve excellent noise performance in moderate noise. However, the 
denoising performance is often sensitive to the model order estimated, and theoretical guarantees 
for these methods are all asymptotic with no finite sample error bounds. Motivated by recent work 
on atomic norms [12] , we propose a convex relaxation approach to denoise a mixture of complex 
exponentials, with theoretical guarantees of noise robustness and a better empirical performance 
than previous subspace based approaches. 

*A preliminary version of this work appeared in the Proceedings of the 49th Annual Allerton Conference in 2011. 
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Our first contribution is an abstract theory of denoising with atomic norms. Atomic norms 
provide a natural convex penalty function for discouraging specialized notions of complexity. These 
norms generalize the t\ norm for sparse vector estimation [13] and the nuclear norm for low-rank 
matrix reconstruction 14, 15 . We show a unified approach to denoising with the atomic norm 
that provides a standard approach to computing low mean-squared-error estimates. We show how 
certain Gaussian statistics and geometrical quantities of particular atomic norms are sufficient to 
bound estimation rates with these penalty functions. Our approach is essentially a generalization 
of the Lasso 16,17 to infinite dictionaries. 

Specializing these denoising results to the line spectral estimation, we provide mean-squared- 
error estimates for denoising line spectra with the atomic norm. The denoising algorithm amounts 
to soft thresholding the noise corrupted measurements in the atomic norm and we thus refer to 
the problem as Atomic norm Soft Thresholding (AST). We show, via an appeal to the theory of 
positive polynomials, that AST can be solved using semidefinite programming (SDP) [18] , and we 
provide a reasonably fast method for solving this SDP via the Alternating Direction Method of 
Multipliers (ADMM) [19[[20| . Our ADMM implementation can solve instances with a thousand 
observations in a few minutes. 

While the SDP based AST algorithm can be thought of as solving an infinite dimensional Lasso 
problem, the computational complexity can be prohibitive for very large instances. To compensate, 
we show that solving the Lasso problem on an oversampled grid of frequencies approximates the 
solution of the atomic norm minimization problem to a resolution sufficiently high to guarantee 
excellent mean-squared error (MSE). The gridded problem reduces to the Lasso, and by leveraging 
the Fast Fourier Transform (FFT), can be rapidly solved with freely available software such as 
SpaRSA [21 1 . A Lasso problem with thousands of observations can be solved in under a second 
using Matlab on a laptop. The prediction error and the localization accuracy for line spectral 
estimation both increase as the oversampling factor increases, even if the actual set of frequencies 
in the line spectral signal are off the Lasso grid. 

We compare and contrast our algorithms, AST and Lasso, with classical line spectral algo- 



rithms including MUSIC |8|,and Cadzow's 22 and Matrix Pencil 10 methods . Our experiments 
indicate that both AST and the Lasso approximation outperform classical methods in low SNR 
even when we provide the exact model order to the classical approaches. Moreover, AST has the 
same complexity as Cadzow's method, alternating between a least-squares step and an eigenvalue 
thresholding step. The discretized Lasso-based algorithm has even lower computational complexity, 
consisting of iterations based upon the FFT and simple linear time soft-thresholding. 



1.1 Outline and summary of results 

We describe here our approach to a general sparse denoising problem and later specialize these 
results to line spectral estimation. The denoising problem is obtaining an estimate x of the signal 
x* from y = x* + w, where w is additive noise. We make the structural assumption that x* is a 
sparse non- negative combination of points from an arbitrary, possibly infinite set A C C ra . This 
assumption is very expressive and generalizes many notions of sparsity |12| . The atomic norm 
introduced in [12] , is a penalty function specially catered to the structure of A as we shall examine 
in depth in next section, and is defined as: 

\\x\\a = inf {t > | x 6 t conv(^l)} . 
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where conv(_4) is the convex hull of points in A. We analyze the denoising performance of an 
estimate that uses the atomic norm to encourage sparsity in A. 



Atomic norm denoising. In Section [2j we characterize the performance of the estimate x that 
solves 

1, .., , , . 

minimize —\\x — y|| 2 + t||x||^. (1-1) 

where r is an appropriately chosen regularization parameter. We provide an upper bound on the 
MSE when the noise statistics are known. Before we state the theorem, we note that the dual norm 
||" 11^4 > corresponding to the atomic norm, is given by 

||z||^ = sup (z, a), 
aeA 

where {x, z) = Re(z*x) denotes the real inner product. 

Theorem 1. Suppose we observe the signal y = x* + w where x* G C n is a sparse nonnegative com- 
bination of points in A. The estimate x of x* given by the solution of the atomic soft thresholding 
problem (1.1) with r > E||w||^ has the expected (per-element) MSE 

1 r 

-E||x-x*||| < -||s*|U 
n n 

This theorem implies that when E||io||^ is o(n), the estimate x is consistent. 



Choosing the regularization parameter. Our lower bound on r is in terms of the expected 
dual norm of the noise process w, equal to 

^IIHI!a = E[sup (it?, a}]. 

That is, the optimal r and achievable MSE can be estimated by studying the extremal values of 
the stochastic process indexed by the atomic set A. 



Denoising line spectral signals. After establishing the abstract theory, we specialize the results 
of the abstract denoising problem to line spectral estimation in Section [3j Consider the continu- 
ous time signal x*(t),t £ M with a line spectrum composed of k unknown frequencies u*, . . . ,u)t 
bandlimited to [— W, W]. Then the Nyquist samples of the signal are given by 

k 

< ■■= (w) = E c ~!^ mft ,m = 0,...,n-l (1.2) 
i=i 

where c*, . . . , c* k are unknown complex coefficients and /* = ^7 for I = 1, . . . , k are the normalized 
frequencies. So, the vector x* = [xq ■ ■ ■ x*_ 1 ] T G C n can be written as a non-negative linear 
combination of k points from the set of atoms 

A=[e a ^[l e l2 * f ••• e 12 ^™- 1 ^] 7 , fe [0,1], [0,1]}. 

The set A can be viewed as an infinite dictionary indexed by the continuously varying parameters 
/ and (p. When the number of observations, n, is much greater than k, x* is A;-sparse and thus line 
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spectral estimation in the presence of noise can be thought of as a sparse approximation problem. 
The regularization parameter for the strongest guarantee in Theorem [T] is given in terms of the 
expected dual norm of the noise and can be explicitly computed for many noise models. For 
example, when the noise is Gaussian, we have the following theorem for the MSE: 

Theorem 2. Assume x* G C n is given by x* m = Y2i=i c*e t2nm f* for some unknown complex 
numbers c*, . . . , c£, unknown normalized frequencies /*, ...,/£ E [0,1] and w G M(0, a 2 I n ). Then 
the estimate x of x* obtained from y = x* + w given by the solution of atomic soft thresholding 
problem (1.1) with r = <J\Jn log(n) has the asymptotic MSE 



iE||x-*1|!<<J33£ 
n V n 

It is instructive to compare this to the trivial estimator x = y which has a per-element MSE 
of a 2 . In contrast, Theorem 2 guarantees that AST produces a consistent estimate when k = 
o{\Jn/ log(ra)). 



Computational methods. We show in Section 3.1 that ( 1.1 ) for line spectral estimation can be 
reformulated as a semidefinite program and can be solved on moderately sized problems via semidef- 
inite programming. We also show that we get similar performance by discretizing the problem and 
solving a Lasso problem on a grid of a large number of points using standard i\ minimization soft- 
ware. Our discretization results justify the success of Lasso for frequency estimation problems (see 



for instance, 23-25]), even though many of the common theoretical tools for compressed sensing do 
not apply in this context. In particular, our measurement matrix does not obey RIP or incoherence 
bounds that are commonly used. Nonetheless, we are able to determine the correct value of the 
regularization parameter, derive estimates on the MSE, and obtain excellent denoising in practice. 

Localizing the frequencies using the dual problem. The atomic formulation not only offers 
a way to denoise the line spectral signal, but also provides an efficient frequency localization method. 



After we obtain the signal estimate x by solving (1.1), we also obtain the solution z to the dual 
problem as z = y — x. As we shall see in Corollary 1, the dual solution z both certifies the optimality 
of x and reveals the composing atoms of x. For line spectral estimation, this provides an alternative 
to polynomial interpolation for localizing the constituent frequencies. 

Indeed, when there is no noise, Candes and Fernandez- Granda showed the dual solution recovers 
these frequencies exactly under mild technical conditions [26] . This frequency localization technique 
is later extended in [27 to the random undersampling case to yield a compressive sensing scheme 



that is robust to basis mismatch. When there is noise, numerical simulations show that the atomic 



norm minimization problem (1.1) gives approximate frequency localization. 



Experimental results. A number of Prony-like techniques have been devised that are able to 
achieve excellent denoising and frequency localization even in the presence of noise. Our experi- 
ments in Section [5] demonstrate that our proposed estimation algorithms outperform Matrix Pencil, 
MUSIC and Cadzow's methods. Both AST and the discretized Lasso algorithms obtain lower MSE 
compared to previous approaches, and the discretized algorithm is much faster on large problems. 
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2 Abstract Denoising with Atomic Norms 



The foundation of our technique consists of extending recent work on atomic norms in linear inverse 



problems in 12 . In this work, the authors describe how to reconstruct models that can be expressed 
as sparse linear combinations of atoms from some basic set A. The set A can be very general and 
not assumed to be finite. For example, if the signal is known to be a low rank matrix, A could be 
the set of all unit norm rank-1 matrices. 

We show how to use an atomic norm penalty to denoise a signal known to be a sparse nonnegative 
combination of atoms from a set A. We compute the mean-squared-error for the estimate we thus 
obtain and propose an efficient computational method. 

Definition 1 (Atomic Norm). The atomic norm ||-||,4 of A is the Minkowski functional (or the 
gauge function) associated with conv(^l) (the convex hull of .A): 

\\x\\ A = inf {t > | x G tconv(A)} . (2.1) 

The gauge function is a norm if conv(„4) is compact, centrally symmetric, and contains a ball 
of radius e around the origin for some e > 0. When A is the set of unit norm 1-sparse elements 
in C n , the atomic norm ||-||„4 is the i\ norm (l3| . Similarly, when A is the set of unit norm rank-1 
matrices, the atomic norm is the nuclear norm |14|. In [12], the authors showed that minimizing the 
atomic norm subject to equality constraints provided exact solutions of a variety of linear inverse 
problems with nearly optimal bounds on the number of measurements required. 

To set up the atomic norm denoising problem, suppose we observe a signal y = x* + w and 
that we know a priori that x* can be written as a linear combination of a few atoms from A. One 
way to estimate x* from these observations would be to search over all short linear combinations 
from A to select the one which minimizes \\y — x\\2- However, this could be formidable: even if the 
set of atoms is a finite collection of vectors, this problem is the NP-hard SPARSEST VECTOR 
problem |28|. 



On the other hand, the problem (1.1) is convex, and reduces to many familiar denoising strate- 



gies for particular A. The mapping from y to the optimal solution of (1.1) is called the proximal 
operator of the atomic norm applied to y, and can be thought of as a soft thresholded version of 
y. Indeed, when A is the set of 1-sparse atoms, the atomic norm is the ^i-norm, and the proximal 
operator corresponds to soft-thresholding y by element-wise shrinking towards zero [29]. Similarly, 
when A is the set of rank-1 matrices, the atomic norm is the nuclear norm and the proximal 
operator shrinks the singular values of the input matrix towards zero. 



We now establish some universal properties about the problem (1.1). First, we collect a simple 
consequence of the optimality conditions in a lemma: 



Lemma 1 (Optimality Conditions), x is the solution of (1.1) if and only if 

ft) \\y - x \\*a ^ r > fa) (y -x,x) = t\\x\\ a . 



The dual atomic norm is given by 



which implies 



\z\\* A = sup (x,z), (2.2) 
IMU<1 



(x,z) < \\x\\ A \\z\\ A . (2.3) 
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The supremum in (2.2) is achievable, namely, for any x there is a z that achieves equality. Since 
A contains all extremal points of {x : \\x\\a < 1}, we are guaranteed that the optimal solution will 
actually lie in the set A: 

\\z\\* A = sup(a,z). (2.4) 

The dual norm will play a critical role throughout, as our asymptotic error rates will be in 
terms of the dual atomic norm of noise processes. The dual atomic norm also appears in the dual 



problem of ( 1.1 ) 



Lemma 2 (Dual Problem). The dual problem of (1.1) is 

maximize * - 
z 2 

subject to \\z\\a < t. 

The dual problem admits a unique solution z due to strong concavity of the objective function. 
The primal solution x and the dual solution z are specified by the optimality conditions and there 
is no duality gap: 

(i) y = x + z, (ii) \\z\\ A < r, (Hi) (z, x) = t\\x\\ a . 

The proofs of Lemma [T] and Lemma [2] are provided in Appendix [A} A straightforward corollary 
of this Lemma is a certificate of the support of the solution to ( |1.1[ ). 

Corollary 1 (Dual Certificate of Support). Suppose for some S C A, z is a solution to the dual 
problem Q satisfying 

1. {z, a) = t whenever a £ S, 

2. \(z,a}\ < t if a S. 



Then, any solution x of (1.1) admits a decomposition x = X^ae5 c i a with \\ x \\a = Saes c a- 



Thus the dual solution z provides a way to determine a decomposition of x into a set of ele- 
mentary atoms that achieves the atomic norm of x. In fact, one could evaluate the inner product 
(z,a) and identify the atoms where the absolute value of the inner product is r. When the SNR 
is high, we expect that the decomposition identified in this manner should be close to the original 
decomposition of x* under certain assumptions. 

We are now ready to state a proposition which gives an upper bound on the MSE with the 
optimal choice of the regularization parameter. 



Proposition 1. If the regularization parameter r > the optimal solution x of (1.1) has the 

MSE 

-\\x-x*\\l < - (r||sc*|U - (x*,w)) < —\\x*\\a- (2.5) 
n n n 

Proof. 

\\x — = (x — x* , w — (y — x)) (2-6) 
= (x*, y — x) — {x*,w) + (x, w) — (x, y — x) 

< t\\x*\\ a - {x* t w) + (\\w\\* A - t)\\x\\ a (2.7) 
<(t+H\ a )\\x*\\ a + (\\w\\ a -t)\\x\\ a (2.8) 
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where for (2.7) we have used Lemma [I] and (2.3). The theorem now follows from (2.7) and (2.8) 



since r > ||w||V The value of the regularization parameter r to ensure the MSE is upper bounded 
thus, is H^H^. □ 

Example: Sparse Model Selection We can specialize our stability guarantee to Lasso [16] and recover 
known results. Let $ £ M" xp be a matrix with unit norm columns, and suppose we observe 
y = x* + w, where w is additive noise, and x* = <£c* is an unknown k sparse combination of columns 
of <!>. In this case, the atomic set is the collection of columns of <3? and — $, and the atomic norm is 



\ x \\a = mm {|| c l|i : x = ^c}. Therefore, the proposed optimization problem (1.1) coincides with the 



Lasso estimator 16 . This method is also known as Basis Pursuit Denoising |17| . If we assume that 
if is a gaussian vector with variance a 2 for its entries, the expected dual atomic norm of the noise 
term, ||u>||^ = ||$* t/^lloo is simply the expected maximum of p gaussian random variables. Using the 
well known result on the maximum of gaussian random variables 130] , we have E||u;||^ < o^Jl log(p). 



If x is the denoised signal, we have from Theorem [l] that if r = E||w;||^ = log(p), 

1 ir.1i - *ii2/ vgMg n *,. 

— E \\x — X \\n < o~— C 1, 

n n 

which is the stability result for Lasso reported in [3l] assuming no conditions on <E>. 
2.1 Accelerated Convergence Rates 

In this section, we provide conditions under which a faster convergence rate can be obtained for 
AST. 

Proposition 2 (Fast Rates). Suppose the set of atoms A is centrosymmetric and ||w||^ concen- 
trates about its expectation so that P(||w||^ > E ||n>||*4 + t) < 5(t). For 7 € [0,1], define the 
cone 

C 7 (x*, A) = cone({z : \\x* + z\\a < ||^*||yl + tII^IU})- 

Suppose 

(x* t A) :=inf : z G C~,(x*,A)\ (2.9) 



\ Z \\A 

is strictly positive for some 7 > E||w||^/r. Then 

||a-a;*|||< ^r^^o ( 2 -!0) 
11 112 ~ 7 2 7 (j;*,^) 2 v ' 

with probability at least 1 — 5{^fT — E||w[|^). 

Having the ratio of norms bounded below is a generalization of the Weak Compatibility criterion 



used to quantify when fast rates are achievable for the Lasso 32 . One difference is that we define 
the corresponding cone C 7 where 7 must be controlled in parallel with the tangent cones studied 
in 12 . There, the authors showed that the mean width of the cone Cq(x*,A) determined the 
number of random linear measurements required to recover x* using atomic norm minimization. In 
our case, 7 is greater than zero, and represents a "widening" of the tangent cone. When 7 = 1, the 
cone is all of M n or C n (via the triangle inequality), hence r must be larger than the expectation 
to enable our proposition to hold. 
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Proof. Since x is optimal, we have, 



§l|y-^lli + ' r PIU < 2 1 lly- x *ll2 + r ll x *IU 

Rearranging and using (2.3) gives 

< r||x*||^4 + H^ll^llx — x*||_4. 

Since \\w ||^ concentrates about its expectation, with probability at least 1 — 5{^fT — E||u>||^), we 
have < 7T and hence x — x* 6 C 7 . Using (2.6), if t > 



\\x - x*\\\ < (r + ||w||^)||x - x*\\a < 
So, with probability at least 1 — 8{^t — E||io||^): 



(1 + 7)t 

i4>Jx*,A) 



\X — X 2 



X — X 



„*l|2 



2^2 



< 



7 2 7 (z*,.A) 5 



□ 



The main difference between (2.10) and (2.5) is that the MSE is controlled by r rather than 



r||x*||^. As we will now see (2.10) provides minimax optimal rates for the examples of sparse 
vectors and low-rank matrices. 

Example: Sparse Vectors in Noise Let A be the set of signed canonical basis vectors in W 1 . In 
this case, conv(„4) is the unit cross polytope and the atomic norm coincides with the l\ norm, 
and the dual atomic norm is the norm. Suppose x* 6 W 1 and T := supp(x*) has cardinality k. 
Consider the problem of estimating x* from y = x* + w where w ~ A^(0, o~ 2 I n ). 

We show in the appendix that in this case (j)j(x*,A) > \^ ■ We also have tq = K\\w\\oo > 



cr-y/2 log(n). Pick r > 7 1 tq for some 7 < 1. Then, using our lower bound for ^ 7 in (2.10), we get 
a rate of 



1 



x — x 



*i|2 



o 



n 



o~ 2 k\og{n) 



(2.11) 



for the AST estimate with high probability. This bound coincides with the minimax optimal rate 



derived by Donoho and Johnstone [33] . Note that if we had used (2.5) instead, our MSE would 
have instead been O (^y^a 2 k log n||x*||2/n J , which depends on the norm of the input signal x*. 

Example: Low Rank Matrix in Noise Let A be the manifold of unit norm rank-1 matrices 
in C nxn . In this case, the atomic norm ||-||^, coincides with the nuclear norm ||-||*, and the 
corresponding dual atomic norm is the spectral norm of the matrix. Suppose X* G C™ xn has rank 
r, so it can be constructed as a combination of r atoms, and we are interested in estimating X* 
from Y = X* + W where W has independent JV(0, a 2 ) entries. 

We prove in the appendix that (f)y(X*,A) > ir^=. To obtain an estimate for r, we note that 



2V2r 



the spectral norm of the noise matrix satisfies ||W|| < 2y/n with high probability [34]. Substituting 
these estimates for r and 7 in (2.10), we get the minimax optimal MSE 



2.2 Expected MSE for Approximated Atomic Norms 



We close this section by noting that it may sometimes be easier to solve (1.1) on a different set A 
(say, an e-net of A instead of A. If for some M > 0, 

J ^~ 1 || a; ll v 4 — \\ x \\a — ll x ll.4 

holds for every x, then Theorem [T] still applies with a constant factor M. We will need the following 
lemma. 

Lemma 3. \\z\\^ < M\\z\\^ for every z iff M _1 ||x||^ < for every z. 

Proof. We will show the forward implication - the converse will follow since the dual of the dual 
norm is again the primal norm. By (2.3), for any x, there exists a z with \\z\\% < 1 and (x, z) = \\x\\^. 
So, 



M _1 ||a;|| t = M~ x (x,z) 



< M~ x \\z\\^\\x\U 

< \\ x \\a 



by Q) 

by assumption. 



□ 



Now, we can state the sufficient condition for the following proposition in terms of either the 
primal or the dual norm: 



Proposition 3. Suppose 
or equivalently 



\ Z \\*_A — \\ z \\a — M\\z\\j£ for every z, 



M 1 ||x||^- < ||x||,4 < ||a;|| 2 for every x 



A 



(2.12) 
(2.13) 



then under the same conditions as in Theorem [TJ 



Ie||s-x*||1<— \\x*\\a 
n n 



where x is the optimal solution for (1.1) with A replaced by A. 

Proof. By assumption, E([|u?||^) < r. Now, (2.12) implies E (lI'HI^ — T - Applying Theorem jlj 
and using (2.13), we get 

Mr, 



1 



E||5-a;*||! < ~\\ x *\\ J< — \\ X *\U 



n 



n 



n 



□ 



3 Application to Line Spectral Estimation 

Let us now return to the line spectral estimation problem, where we denoise a linear combination of 
complex sinusoids. The atomic set in this case consists of samples of individual sinusoids, af a £ C n , 
given by 

a M = e i2 ^ [1 e^f ■ ■ ■ e *M»-i)/] T . (3.1) 
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{ a f,<f> '■ f £ [0)1]) ^ [0)1]} forms an appropriate collection of atoms for 



The infinite set A 

x*, since x* in (1.2) can be written as a sparse nonnegative combination of atoms in A. In fact, 

x* = Eti c t a f!V = Ef=i KI«/r,^P where c\ = |c*|e i2 ^. 
The corresponding dual norm takes an intuitive form: 

n-l 

|v||^4 = sup(-u, af 0) = sup sup e l27T ^2~] v i t 
/e[o,i]</>e[o,i] 



swp(v,a,f, 
f,<f> 



-2ml f 



1=0 



sup 



n-l 

£ 



(3.2) 



In other words, ||w||^ is the maximum absolute value attained on the unit circle by the polynomial 
£ 1 — y YjI=o v lC l - Thus, in what follows, we will frequently refer to the dual polynomial as the 
polynomial whose coefficients are given by the dual optimal solution of the AST problem. 

3.1 SDP for Atomic Soft Thresholding 

In this section, we present a semidefinite characterization of the atomic norm associated with the 
line spectral atomic set A = {a/,</>|/ E [0, !],</>£ [0, 1]}. This characterization allows us to rewrite 



(1.1) as an equivalent semidefinite programming problem. 



Recall from (3.2) that the dual atomic norm of a vector v G C n is the maximum absolute value 
of a complex trigonometric polynomial V{f) = Yl?^ vie~ 2ml $ ' . As a consequence, a constraint on 



the dual atomic norm is equivalent to a bound on the magnitude of V(f): 

\\v\\* A <T^\v(f)\ 2 <r 2 yfe[o,i}. 

The function q(f) = t 2 — |V(/)| 2 is a trigonometric polynomial (that is, a polynomial in the 
variables z and z* with \z\ = 1). A necessary and sufficient condition for q(f) to be nonnegative 
is that it can be written as a sum of squares of trigonometric polynomials [18]. Testing if q is a 
sum of squares can be achieved via semidefinite programming. To state the associated semidefinite 
program, define the map T : C n — > C nxn which creates a Hermitian Toeplitz matrix out of its 
input, that is 

r Xl x 2 ... x n 



T(x) 



■"n-l 



%n—l 



Let T* denote the adjoint of the map T. Then we have the following succinct characterization 

Lemma 4. 35 , Theorem 4.24] For any given causal trigonometric polynomial V(f) = El^To 1 v i^~ 2ml ^ \ 
\V(f)\ < t if and only if there exists complex Hermitian matrix Q such that 

~Q 



T*{Q) = r z ei and 



>- 0. 



Here, e\ is the first canonical basis vector with a one at the first component and zeros elsewhere 
and v* denotes the Hermitian adjoint (conjugate transpose) of v. 

Using Lemma|4j we rewrite the atomic norm ||x||^ = sup^n*^ (x,v) as the following semidef- 
inite program: 



maximize^ 
subject to 



(x,v) 
T*(Q) 
Q v 

V* 1 



= ei 
>- 0. 



(3.3) 
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The dual problem of (3.3) (after a trivial rescaling) is then equal to the atomic norm of x: 



\ X \\A 



mm t)U 
subject to 



~T(u) x 

X* t 



>- 0. 



Therefore, the atomic denoising problem (1.1) for the set of trigonometric atoms is equivalent to 

i 



mimmizet jUia: 
subject to 



\x - 
T(u 

x* 



2 + §(i + U!) 

>- 0. 



2/112 i 2 
X 

t 



(3.4) 



The semidefinite program (3.4) can be solved by off-the-shelf solvers such as SeDuMi |36| and 



SDPT3 [37] . However, these solvers tend to be slow for large problems. For the interested reader, we 
provide a reasonably efficient algorithm based upon the Alternating Direction Method of Multipliers 
(ADMM) [20] in Appendix 



3.2 Choosing the regularization parameter 

The choice of the regularization parameter is dictated by the noise model and we show the optimal 
choice for white gaussian noise samples in our analysis. As noted in Theorem [TJ the optimal choice 
of the regularization parameter depends on the dual norm of the noise. A simple lower bound on 
the expected dual norm occurs when we consider the maximum value of n uniformly spaced points 
in the unit circle. Using the result of [30| , the lower bound whenever n > 5 is 

a J n log(n) — | log(47r log(ra)) . 

Using a theorem of Bernstein and standard results on the extreme value statistics of Gaussian 
distribution, we can also obtain a non-asymptotic upper bound on the expected dual norm of noise 
for n > 3: 

er ( 1 + - — -— ] J n log(n) + n log(47r log(n)) 
V log(n)/ 

(See Appendix [D] for a derivation of both the lower and upper bound). If we set the regularization 
parameter t equal to an upper bound on the expected dual atomic norm, i.e., 



a 1 + 



log(n) 



y/n\og(n) + n log(4-7r log(n)). 



(3.5) 



an application of Theorem [T] yields the asymptotic result in Theorem [2] 
3.3 Determining the frequencies 

As shown in Corollary [TJ the dual solution can be used to identify the frequencies of the primal 
solution. For line spectra, a frequency / G [0, 1] is in the support of the solution x of (1.1) if and 
only if 



\(z,af,<l>)\ 



n-l 
1=0 



-i2irlf 
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Figure 1: Frequency Localization using Dual Polynomial: The actual location of the 
frequencies in the line spectral signal x* G C 64 is shown in red. The blue curve is the dual 
polynomial obtained by solving (|2j) with y = x* + w where w is noise of SNR 10 dB. 



That is, / is in the support of x if and only if it is a point of maximum modulus for the dual poly- 
nomial. Thus, the support may be determined by finding frequencies / where the dual polynomial 
attains magnitude r. 

Figure [l] shows the dual polynomial for (1.1) with n = 64 samples and k = 6 randomly chosen 



frequencies. The regularization parameter r is chosen as described in Section 3.2 
A recent result by Candes and Fernandez- Granda 



[26] establishes that in the noiseless case, 
the frequencies localized by the dual polynomial are exact provided the minimum separation be- 
tween the frequencies is at least 4/n where n is the number of samples in the line spectral signal. 



Under similar separation condition, numerical simulations suggest that (1.1) achieves approximate 
frequency location in the noisy case. 



3.4 Discretization and Lasso 

When the number of samples is larger than a few hundred, the running time of our ADMM method 
is dominated by the cost of computing eigenvalues and is usually expensive |38|. For very large 
problems, we now propose using Lasso as an alternative to the semidefinite program (13 . 41) . To 



proceed, pick a uniform grid of N frequencies and form An = {< 



"m/N,(j> 



solve (1.1) on this grid, i.e., we solve the problem 



1 



minimize 



yh + r IMUiv- 



0<m<N-l}cA and 



(3.6) 



To see why this is to our advantage, define <3? be the n x N Fourier matrix with mth column 
a m/N,o- Then for any x 6 C n we have H^H^ = min{||c||i : x = $c}. So, we solve 



minimize — II $c • 
2" 



yh + r ll c lli- 



(3.7) 



for the optimal point c and set xn = or the first n terms of the N term discrete Fourier transform 
(DFT) of c. Furthermore, &*z is simply the N term inverse DFT of z £ C n . This observation 
coupled with Fast Fourier Transform (FFT) algorithm for efficiently computing DFTs gives a fast 
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method to solve ( |3.6| ), using standard compressed sensing software for £2 — £1 minimization, for 
example, SparSA |21|. 



Because of the relatively simple structure of the atomic set, the optimal solution x for (3.6) can 



be made arbitrarily close to (|3.4|) by picking ./V a constant factor larger than n. In fact, we show 

■e equivalent (See Appendix [C]) : 

x\\ AN <\\x\\ A <\\x\\ AN ,VxeC n (3.8) 



that the atomic norms on A and An are equivalent (See Appendix |C|) : 

2irn 



N 

Using Proposition [3] and (3.5), we conclude 



a 

„*l|2 



( ^"nt 1 ) \\^\\A\/n\og{n) +nlog(4^1og(n)) ^ ^ /log^ ||s*|U 



n(l-^) ^^V n 

Due to the efficiency of the FFT, the discretized approach has a much lower algorithmic com- 
plexity than either Cadzow's alternating projections method or the ADMM method described in 
Appendix [El which each require computing an eigenvalue decomposition at each iteration. Indeed, 



fast solvers for (3.7) converge to an e optimal solution in no more than l/y/e. iterations. Each 
iteration requires a multiplication by $ and a simple "shrinkage" step. Multiplication by <3? or 
requires 0(N log N) time and the shrinkage operation can be performed in time O(N). 

As we discuss below, this fast form of basis pursuit has been proposed by several authors. 
However, analyzing this method with tools from compressed sensing has proven daunting because 
the matrix $ is nowhere near a restricted isometry. Indeed, as N tends to infinity, the columns 
become more and more coherent. However, common sense says that a larger grid should give better 
performance, for both denoising and frequency localization! Indeed, by appealing to the atomic 
norm framework, we are able to show exactly this point: the larger one makes N, the closer one 
approximates the desired atomic norm soft thresholding problem. Moreover, we do not have to 
choose N to be too large in order to achieve nearly the same performance as the AST. 



4 Related Work 

The classical methods of line spectral estimation, often called linear prediction methods, are built 



upon the seminal interpolation method of Prony 39 . In the noiseless case, with as little as n = 2k 



measurements, Prony's technique can identify the frequencies exactly, no matter how close the 
frequencies are. However, Prony's technique is known to be sensitive to noise due to instability 
of polynomial rooting [40]. Following Prony, several methods have been employed to robustify 
polynomial rooting method including the Matrix Pencil algorithm flO] , which recasts the poly- 
nomial rooting as a generalized eigenvalue problem and cleverly uses extra observations to guard 
against noise. The MUSIC [8] and ESPRIT |9| algorithms exploit the low rank structure of the 
autocorrelation matrix. 



Cadzow 22 proposed a heuristic that improves over MUSIC by exploiting the Toeplitz structure 
of the matric of moments by alternately projecting between the linear space of Toeplitz matrices 
and the space of rank k matrices where k is the desired model order. Cadzow's technique is 



very similar [41] to a popular technique in time series literature 42,43 called Singular Spectrum 
Analysis [44] , which uses autocorrelation matrix instead of the matrix of moments for projection. 
Both these techniques may be viewed as instances of structured low rank approximation |45| which 
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exploit additional structure beyond low rank structure used in subspace based methods such as 
MUSIC and ESPRIT. Cadzow's method has been identified as a fruitful preprocessing step for linear 
prediction methods |46|. A survey of classical linear prediction methods can be found in |46l|47| 



and an extensive list of references is given in 11 



Most, if not all of the linear prediction methods need to estimate the model order by employing 
some heuristic and the performance of the algorithm is sensitive to the model order. In contrast, our 
algorithms AST and the Lasso based method, only need a rough estimate of the noise variance. In 
our experiments, we provide the true model order to Matrix Pencil, MUSIC and Cadzow methods, 
while we use the estimate of noise variance for AST and Lasso methods, and still compare favorably 
to the classical line spectral methods. 

In contrast to linear prediction methods, a number of authors [23-25 have suggested using 
compressive sensing and viewing the frequency estimation as a sparse approximation problem. For 
instance, |24| notes that the Lasso based method has better empirical localization performance than 
the popular MUSIC algorithm. However, the theoretical analysis of this phenomenon is complicated 
because of the need to replace the continuous frequency space by an oversampled frequency grid. 
Compressive sensing based results (see, for instance, |48| ) need to carefully control the incoherence 
of their linear maps to apply off-the-shelf tools from compressed sensing. It is important to note that 
the performance of our algorithm improves as the grid size increases. But this seems to contradict 
conventional wisdom in compressed sensing because our design matrix <E> becomes more and more 
coherent, and limits how fine we can grid for the theoretical guarantees to hold. 

We circumvent the problems in the conventional compresssive sensing analysis by directly work- 
ing in the continuous parameter space and hence step away from such notions as coherence, focussing 
on the geometry of the atomic set as the critical feature. By showing that the continuous approach 
is the limiting case of the Lasso based methods using the convergence of the corresponding atomic 
norms, we justify denoising line spectral signals using Lasso on a large grid. Since the original sub- 
mission of this manuscript, Candes and Fernandez- Granda 26 showed that our SDP formulation 
exactly recovers the correct frequencies in the noiseless case. 



5 Experiments 

We compared the MSE performance of AST, the discretized Lasso approximation, the Matrix 
Pencil, MUSIC and Cadzow's method. For our experiments, we generated k normalized frequencies 
/*, . . . , ft uniformly randomly chosen from [0, 1] such that every pair of frequencies are separated 



by at least l/2n. The signal x* G C n is generated according to (1.2) with k random amplitudes 
independently chosen from x 2 (l) distribution (squared Gaussian). All of our sinusoids were then 
assigned a random phase (equivalent to multiplying c* k by a random unit norm complex number). 
Then, the observation y is produced by adding complex white gaussian noise w such that the input 
signal to noise ratio (SNR) is —10, —5,0,5, 10, 15 or 20 dB. We compare the average MSE of the 
various algorithms in 10 trials for various values of number of observations (n = 64, 128, 256), and 
number of frequencies (k = n/4, n/8, n/16). 

AST needs an estimate of the noise variance a 2 to pick the regularization parameter according 



to (3.5). In many situations, this variance is not known to us a priori. However, we can construct 
a reasonable estimate for a when the phases are uniformly random. It is known that the auto- 
correlation matrix of a line spectral signal (see, for example Chapter 4 in [47] ) can be written as 



a sum of a low rank matrix and a 2 I if we assume that the phases are uniformly random. Since 
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Figure 2: MSE vs SNR plots: This graph compares MSE vs SNR for a subset of experi- 
ments with n = 128 samples. From left to right, the plots are for combinations of 8, 16, and 
32 sinusoids with amplitudes and frequencies sampled at random. 



the empirical autocorrelation matrix concentrates around the true expectation, we can estimate 
the noise variance by averaging a few smallest eigenvalues of the empirical autocorrelation matrix. 
In the following experiments, we form the empirical autocorrelation matrix using the MATLAB 
routine corrmtx using a prediction order m = n/3 and averaging the lower 25% of the eigenvalues. 
We used this estimate in equation (3.5) to determine the regularization parameter for both our 
AST and Lasso experiments. 

First, we implemented AST using the ADMM method described in detail in the Appendix. 
We used the stopping criteria described in |20j and set p = 2 for all experiments. We use the 
dual so lutio n z to determine the support of the optimal solution x using the procedure described in 
Section . 



3.3 



Once the frequencies fi are extracted, we ran the least squares problem minimize a ||[/a;— 
y\\ 2 where Uji = exp(i2-7rj/z) to obtain a debiased solution. After computing the optimal solution 
"opt j we returned the prediction x = U a op t ■ 

We implemented Lasso, obtaining an estimate x of x* from y by solving the optimization 
problem (3.6) with debiasing. We use the algorithm described in Section 3.4 with grid of N = 2 m 
points where m = 10, 11, 12, 13, 14 and 15. Because of the basis mismatch effect, the optimal c op t 
has significantly more non-zero components than the true number of frequencies. However, we 
observe that the frequencies corresponding to the non-zero components of c op t cluster around the 
true ones. We therefore extract one frequency from each cluster of non-zero values by identifying 
the grid point with the maximum absolute c op t value and zero everything else in that cluster. We 
then ran a debiasing step which solves the least squares problem minimize^ || — y\\ 2 where $5 
is the submatrix of $ whose columns correspond to frequencies identified from c op t- We return the 
estimate x = <&sPopt- We used the freely downloadable implementation of SpaRSA to solve the 
Lasso problem. We used a stopping parameter of 10 -4 , but otherwise used the default parameters. 

We implemented Cadzow's method as described by the pseudocode in [46] , the Matrix Pencil 
as described in [To] and MUSIC [8] using the MATLAB routine rootmusic. All these algorithms 
need an estimate of the number of sinusoids. Rather than implementing a heuristic to estimate k, 
we fed the true k to our solvers. This provides a huge advantage to these algorithms. Neither AST 
or the Lasso based algorithm are provided the true value of k, and the noise variance a 2 required 
in the regularization parameter is estimated from y. 

In Figure [2j we show MSE vs SNR plots for a subset of experiments when n = 128 time 
samples are taken to take a closer look at the differences. It can be seen from these plots that the 
performance difference between classical algorithms such as MUSIC and Cadzow with respect to 
the convex optimization based AST and Lasso is most pronounced at lower sparsity levels. When 
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Figure 3: (a) Plot of MSE vs SNR for Lasso at different grid sizes for a subset of experiments 
with n — 128, k — 16. (b) Lasso Frequency localization with n = 32, k = 4, SNR = 10 dB. 
Blue represents the true frequencies, while red are given by Lasso. For better visualization, we 
threshold the Lasso solution by 10~ 6 . 



the noise dominates the signal (SNR < dB), all the algorithms are comparable. However, AST 
and Lasso outperform the other algorithms in almost every regime. 

We note that the denoising performance of Lasso improves with increased grid size as shown in 
the MSE vs SNR plot in Figure|3ja). The figure shows that the performance improvement for larger 
grid sizes is greater at high SNRs. This is because when the noise is small, the discretization error 
is more dominant and finer gridding helps to reduce this error. Figures [3](a) and (b) also indicate 
that the benefits of increasing discretization levels are diminishing with the grid sizes, at a higher 
rate in the low SNR regime, suggesting a tradeoff among grid size, accuracy, and computational 
complexity. 

Finally, in Figure [3^b) , we provide numerical evidence supporting the assertion that frequency 
localization improves with increasing grid size. Lasso identifies more frequencies than the true 
ones due to basis mismatch. However, these frequencies cluster around the true ones, and more 
importantly, finer discretization improves clustering, suggesting over-discretization coupled with 
clustering and peak detection as a means for frequency localization for Lasso. This observation 
does not contradict the results of 49 where the authors look at the full Fourier basis (N = n) and 
the noise- free case. This is the situation where discretization effect is most prominent. We instead 
look at the scenario where N S> n. 

We use performance profiles to summarize the behavior of the various algorithms across all 
of the parameter settings. Performance profiles provide a good visual indicator of the relative 
performance of many algorithms under a variety of experimental conditions [50]. Let V be the set 
of experiments and let MSE s (p) be the MSE of experiment p 6 V using the algorithm s. Then the 
ordinate P S (P) of the graph at (5 specifies the fraction of experiments where the ratio of the MSE 
of the algorithm s to the minimum MSE across all algorithms for the given experiment is less than 

/3 ' 1 ' 6 '' V f^_ #{P^ V ■ MSE s (;p) </?min s MSE s (p)} 
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Figure 4: (a) Performance Profile comparing various algorithms and AST. (b) Performance 
profiles for Lasso with different grid sizes. 



From the performance profile in Figure |4^a), we see that AST is the best performing algorithm, 
with Lasso coming in second. Cadzow does not perform as well as AST, even though it is fed 
the true number of sinusoids. When Cadzow is fed an incorrect k, even off by 1, the performance 
degrades drastically, and never provides adequate mean-squared error. Figure |4^b) shows that the 
denoising performance improves with grid size. 



6 Conclusion and Future Work 

The Atomic norm formulation of line spectral estimation provides several advantages over prior 
approaches. By performing the analysis in the continuous domain we were able to derive simple 
closed form rates using fairly straightforward techniques. We only grid the unit circle at the very 
end of our analysis and determine the loss incurred from discretization. This approach allowed us 
to circumvent some of the more complicated theoretical arguments that arise when using concepts 
from compressed sensing or random matrix theory. 

This work provides several interesting possible future directions, both in line spectral estimation 
and in signal processing in general. We conclude with a short outline of some of the possibilities. 



Fast Rates Determining checkable conditions on the cones in Section 2.1 for the atomic norm 
problem is a major open problem. Our experiments suggest that when the frequencies are spread 
out, AST performs much better with a slightly larger regularization parameter. This observa- 
tion was also made in the model-based compressed sensing literature |48| . Moreover, Candes and 
Fernandez Granda also needed a spread assumption to prove their theories. 

This evidence together suggests the fast rate developed in Section 2A_ may be active for signals 
with well separated frequenies. Determining concrete conditions on the signal x* that ensure this 
fast rate require techniques for estimating the parameter <f> in (2.9). Such an investigation should be 
accompanied by a determination of the minimax rates for line spectral estimation. Such minimax 
rates would shed further light on the rates achievable for line spectral estimation. 



Moments Supported Inside the Disk Our work also naturally extends to moment problems 
where the atomic measures are supported on the unit disk in the complex plane. These problems 
arise naturally in controls and systems theory and include model order reduction, system identifi- 
cation, and control design. Applying the standard program developed in Section [2] provides a new 
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look at these classic operator theory problems in control theory. It would be of significant impor- 
tance to develop specialized atomic-norm denoising algorithms for control theoretic problems. Such 
an approach could yield novel statistical bounds for estimation of rational functions and %oo-norm 
approximations . 

Other Denoising Models Our abstract denoising results in Section [2] apply to any atomic mod- 
els and it is worth investigating their applicability for other models in statistical signal processing. 
For instance, it might be possible to pose a scheme for denoising a signal corrupted by multipath 
reflections. Here, the atoms might be all time and frequency shifted versions of some known signal. 
It remains to be seen what new insights in statistical signal processing can be gleaned from our 
unified approach to denoising. 
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A Optimality Conditions 

A. 0.1 Proof of Lemma Q] 

Proof. The function f{x) = ^\\y — + r||x||^ is minimized at x, if for all a £ (0, 1) and all x, 

f(x + a(x-x)) > fix) 

or equivalently, 



a 1 t [\\x + a(x — x)\\a — \\x\\a) > (y — x, x — x) a\\x — 



(A.l) 



Since ||-||^ is convex, we have 

IMU - \\ x \\a - a ~ x (P + «( x - £)IU ~ II^IU) 



for all x and for all a G (0, 1). Thus, by letting a — > in ( A.l ), we note that x minimizes f(x) only 
if, for all x, 



t(\\x\\a ~ \\x\U) > (y~x,x- x). 



(A.2) 



However if (A.2) holds, then, for all x 



1 



1 



y - x h + T \\ x \U > 7,\\y - x + ( x - x )h + (y - x ,x - x ) + T \\ x \U 



implying fix) > f(x). Thus, (A.2) is necessary and sufficient for x to minimize f(x) 



Note. The condition (A.2) simply says that r 1 (y — x) is in the subgradient of ||-||.a at x or 
equivalently that G df(x). 



We can rewrite (A.2) as 

t\\x\\a ~ (y - x, x) < inf {r||x|U - (y - x, x)} 



But by definition of the dual atomic norm, 

SUp{(z,x) - \\x\\ A } = 







I*I&<1 



oo otherwise. 



(A.3) 



(A.4) 



where Ia{') is the convex indicator function. Using this in (A.3), we find that x is a minimizer if 
and only if \\y — x\\*^ < r and (y — x,x) > r||x||^. This proves the theorem. □ 

A. 0.2 Proof of Lemma [2] 



Proof. We can rewrite the primal problem (1.1) as a constrained optimization problem: 

• • • 112 „ 

minimize —\\y — x\\ 2 + \\u\\a 
x,u 2 

subject to u = x. 
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Now, we can introduce the Lagrangian function 



L(x,u,z) = - \\y - x\\l + ||«|U + (z,x 



so that the dual function is given by 

g{z) = mfL(x,u,z) = inf ( \ \\y - x\\\ + (z,x) ) + inf (t\\u\\ a - (z,u)) 

x,u x \ Z I u 

= 2 [\\y\\l - h - 41) - ^w.www^t}^)- 

where the first infimum follows by completing the squares and the second infimum follows from 
(A.4). Thus the dual problem of maximizing g(z) can be written as in Q. 

The solution to the dual problem is the unique projection z of y on to the closed convex set 
C = {z : \\z\\^ < t}. By projection theorem for closed convex sets, z is a projection of y onto C if 
and only if z G C and (z — z,y — z) < for all z £ C, or equivalently if {z, y — z) > sup 2 (z, y — z) = 
T \\y ~ 4a- These conditions are satisfied for z = y — x where x minimizes f(x) by Lemma [T] Now 
the proof follows by the substitution z = y — x in the previous lemma. The absence of duality gap 
can be obtained by noting that the primal objective function at x, 

/(£) = \h ~ £ ll2 + (z,x) = + (z,x) = g(z). 

□ 



B Fast Rate Calculations 

We first prove the following 

Proposition 4. Let A = {±ei, . . . , ±e n }, be the set of signed canonical unit vectors inR n . Suppose 
x* £ R n has k nonzeros. Then <j) 7 (x*,A) > 

Proof. Let z £ C 7 (x*,^4). For some a > we have, 

+ az\\i < \\x*\\i + 7||a2;||i 

In the above inequality, set z = zt + zt c where zt are the components on the support of T and 
zt c are the components on the complement of T. Since x* + zt and zt c have disjoint supports, we 
have, 

|| x* + Q!Zt||i + a||-ZT c ||i < ||s*|| i + 7||o!£t||i + 7||«^T c ||i • 
This inequality implies 

II II <r 1+ ^ll II 

ZT C l S FT 1 

1-7 

that is, z satisfies the null space property with a constant of jz?- Thus, 

M I, 2 I, I, 2 ^ M I, 

\\ z i < ; k i ^ ^ \\ z h 

1 — 7 1 — 7 

This gives the desired lower bound. □ 
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Now we can turn to the case of low rank matrices. 

Proposition 5. Let A be the manifold of unit norm rank-1 matrices in C nxn . Suppose X* € 
has rank r. Then (j)^(X*,A) > t^a^- 



Proof. Let UT,V H be a singular value decomposition of X* with U G C nxr , V £ C nxr and S G C rxr . 
Define the subspaces 

T = {[/X + YF^ : X,Y£C nxr } 
T = {UMV H : M £ C rxr } 

and let Vt , Vt-> and T T ±_ be projection operators that respectively map onto the subspaces To, T, 
and the orthogonal complement of T. Now, if Z £ C^(X*,A), then for some a > 0, we have 

\\X* + aZ\\* < \\X*\\ m +'ya\\Z\\* < + ia \\V T {Z)\U + -fa\\V T ±{Z)\\^ (B.l) 

Now note that we have 

\\X* + aZ\\* > \\X* + aV To {Z)\U + a\\V T ±(Z)\\* 
Substituting this in ( |B.l ), we have, 



\\X* + aV To (Z)\U + a\\V T ±(Z)\U < \\X*\\* + 7 a\\V T (Z)\U + -ya\\V T ±(Z)\\*. 
Since \\V To (Z)\U < \\V T {Z)\U, we have 

WPtAZ)\U< \^\\v t (z)\u. 

1-7 

Putting these computations together gives the estimate 

2 2\/2r 1\/2r 

\\z\u < \\r T (z)\u + \\v T 4z)\U < i — \\Pt{Z)\\ < ^J_\\ Vt {Z)\\f < Y-hz\\ F . 

1 — 7 1 — 7 1 — 7 

That is, we have <^> 7 (X*,„4.) > ^=2= as desired. □ 

C Approximation of the Dual Atomic Norm 

This section furnishes the proof that the atomic norms induced by A and An are equivalent. Note 
that the dual atomic norm of w is given by 

\\w\\^ = y/n sup W n (e i27rf ) . (C.l) 
/e[0,i] 

i.e., the maximum modulus of the polynomial W n defined by 

n-l 



W n (e i2 ^) = ~ Y\ w m e~ i2 ™f. (C.2) 

. / 71 Z. J 



n 

m=0 
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Treating W n as a function of /, with a slight abuse of notation, define 



\W„ 



sup 

/6[0,1] 



WJe 



%2-nfs 



We show that we can approximate the maximum modulus by evaluating W n in a uniform grid of 
iV points on the unit circle. To show that as N becomes large, the approximation is close to the 
true value, we bound the derivative of W n using Bernstein's inequality for polynomials. 

Theorem 3 (Bernstein, See, for example [51]). Let p n be any polynomial of degree n with complex 
coefficients. Then, 

sup < n sup 

I*l<i l*l<i 

Note that for any /i, fa E [0,1], we have 



W n (e i2nfl ) - W n (e i2lTf2 



< 



e i2wf! _ e i2irf2 



= 2|sin(2vr(/ 1 -/ 2 ))|||W;| 
<4tt(/ 1 -/ 2 )|K|| 00 

<47rn(/i-/ 2 )||W„||oo, 



where the last inequality follows by Bernstein's theorem. Letting s take any of the N values 
, jj . . . , ^jf^, we see that, 







\W, 



n 1 1 < max 

m=0,...,JV-l 



2irn 



IWn 



Since the maximum on the grid is a lower bound for maximum modulus of W n , we have 



max 
m=0,...,iV-l 



W n ( e i27rm / N 



< WWr, 



< 1 



27rn 



< 1 + 



N 
Aim 



max 
m=0,...,iV-l 



, max 

N ) m=0,...,iV-l 



W n (e !2 ™ /N ) 



(C.3) 



(C.4) 



Thus, for every w, 



or equivalently, for every x, 



\w\\* A <\\w\\* A < [1 



2irn 



w 



A* 



(C.5) 



2nn 



ml An — \\ x \\a — \\ x \\Ai> 



(C.6) 
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D Dual Atomic Norm Bounds 



This section derives non-asymptotic upper and lower bounds on the expected dual norm of gaussian 
noise vectors, which are asymptotically tight upto log log factors. Recall that the dual atomic norm 
of w is given by y^suP/ep,!] where 



W f 



1 n—l 

-T 



w m e 



-i2irmf 



m=0 



The covariance function of Wf is 



n-1 



E [ W h W *f 2 ] =lYl exp(2vrm(/ 1 - / 2 )) = e 



t(«-1)(/i-/ 2 ) 



m=0 
, n-l 



sin(n7r(/i - f 2 )) 
nsin(vr(/ 2 - / 2 ))' 



Thus, the n samples {^ m /„}^_ are uncorrelated and thus independent because of their joint 
gaussianity. This gives a simple non-asymptotic lower bound using the known result for maximum 
value of n independent gaussian random variables [30] whenever n > 5: 



E 


sup \Wt\ 


> E 






.teT 




771 



= max_ i Re (W m/n ) 



log(n) 



log log(ra)+log(47r) 
2 



We will show that the lower bound is asymptotically tight neglecting log log terms. Since the 
dual norm induced by An approximates the dual norm induced by A, (See[C]), it is sufficient to 
compute an upper bound for ||w||^ . Note that \Wf\ 2 has a chi-square distribution since Wf is 
a Gaussian process. We establish a simple lemma about the maximum of chi-square distributed 
random variables. 

Lemma 5. Let xi, . . . ,xn be complex gaussians with unit variance. Then, 



E 



max \Xi 

Ki<N 



< ^log(iV) + 1. 



Proof. Let x\, . . . ,xn be complex Gaussians with unit variance: E[|xj| 2 ] = 1. Note that 2|xj| 2 is a 
chi-squared random variable with two degrees of freedom. Using Jensen's inequality, also observe 
that 

1/2 

(D.l) 

Now let z\ , . . . , z n be chi-squared random variables with 2 degrees of freedom. Then we have 



dt 
dt 











1/2 




E 


max \xA 

l<i<N 


< E 


max \xi 2 

l<i<N 


^7S E 


max 2 \xi 2 

l<i<N 











E 


max Zi 


= P 


max Zi > t 




l<i<N 


Jo 


l<i<N 



p 



max Zi > t 

Ki<N 



/•oo 

<5 + N J P[zi>t] dt 

/•oo 

=5 + N exp(-t/2)di 
=5 + 2iYexp(-5/2) 
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Setting 5 = 21og(iV) gives E [maxi<j<7v Zj\ < 2 log N + 2. Plugging this estimate into (D.l) gives 
E [maxi<j<Ar |xj|] < ^log N + 1. □ 



Using Lemma [5j we can compute 



n max 
m=0,...,JV-l 



W n (e i2 ™/ N ^ I < av/n(logiV 



+ D 



Plugging in N = 4-7rnlog(n) and using (C.l) and (C.4) establishes a tight upper bound. 



E Alternating Direction Method of Multipliers for AST 



A thorough survey of the ADMM algorithm is given in [20] . We only present the details essential 
to the implementation of atomic norm soft thresholding. To put our problem in an appropriate 
form for ADMM, rewrite (3.4) as 

minimizei jUia;i z \\\x - y\\l + ^(t + ui) 

Tiu] x 
subject to Z = \ 

I x t 

and dualize the equality constraint via an Augmented Lagrangian: 

1 r 

£ p (t,u,x,Z, A) = -\\x - y\\l + -(t + m)+ 

A,Z 



T(u) x 


B 


z - 


T(u) x 




x* t_ 






x* t 





ADMM then consists of the update steps: 



(t l+1 ,u l+1 ,x l+1 ) «- argmmC p (t,u,x,Z l , A 1 ) 



Z 



l+i 



arg mm£ (t 



u lJr ',x~ 



t,u,x 

l+l „,l+l J+l 



Z, A 



A /+i ^_ A i + . z 



l+i 



T{u l+1 ) x l+r 
x i+i* t i+i 



The updates with respect to t, x, and u can be computed in closed form: 



t l+1 — Z l n+l ^ n+l + \ A l n 



T 

; o \ i 



Ip 



-{y + 2pz[ + 2X l 1 



2p+l 

= w (t*(z 1 + a 1 / p ) 



T 

2p 



ei 



Here W is the diagonal matrix with entries 

Wu = 



i 



2(n-i+l) 



i = 1 
i > 1 
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and we introduced the partitions: 



7 l 
z 

l * 

A 



7} 



and A 



A 



A' 
°* 

M M-i+l,n+l 



The Z update is simply the projection onto the positive definite cone 



Z 



l+i 



are; mm 



Z 



T(u l+1 ) 



d+l 



+ A'/P 



(E.l) 



Projecting a matrix Q onto the positive definite cone is accomplished by forming an eigenvalue 
decomposition of Q and setting all negative eigenvalues to zero. 

To summarize, the update for (t, u, x) requires averaging the diagonals of a matrix (which is 
equivalent to projecting a matrix onto the space of Toeplitz matrices), and hence operations that 
are 0(n). The update for Z requires projecting onto the positive definite cone and requires 0(n 3 ) 
operations. The update for A is simply addition of symmetric matrices. 

Note that the dual solution z can be obtained as z = y — x from the primal solution x obtained 
from ADMM by using Lemma [2] 
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